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^ Abstract 

04 We consider an extension of the setting of label ranking, in which the learner is 

allowed to make predictions in the form of partial instead of total orders. Pre- 

l_{ dictions of that kind are interpreted as a partial abstention: If the learner is not 

sufficiently certain regarding the relative order of two alternatives, it may abstain 
from this decision and instead declare these alternatives as being incomparable. 
Q We propose a new method for learning to predict partial orders that improves on 

I— —I an existing approach, both theoretically and empirically. Our method is based on 

the idea of thresholding the probabilities of pairwise preferences between labels 
as induced by a predicted (parameterized) probability distribution on the set of all 
rankings. 
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O 1 Introduction 
(N 

T— I In the setting of label ranking, a special type of preference learning problem, each instance x from 

y—^ an instance space X is associated with a total order of a fixed set of class labels y — {yi, . . . , j/a/}, 

^-H that is, a complete, transitive, and asymmetric relation on 3^, where yi >~x yj indicates that, 

^ for instance x, yi precedes yj in the order Since a ranking can be considered as a special type of 

• ^ preference relation, we shall also say that yi >-x yj indicates that yi is preferred to y^ given the 

^ instance x. 

Formally, a total order '^^ can be identified with a permutation tt^ of the set {1, ... , M}, such that 
7ra;(i) is the index j of the class label yj on the «-th position in the order (and hence 7r~^(j) = i the 
position of the j-th label). This permutation thus encodes the (ground truth) order relation 

We denote the class of permutations of {1, ... , M} (the symmetric group of order M) by fi. 

The goal in label ranking is to leam a "label ranker" in the form of an X — > il mapping. As training 
data, a label ranker uses a set of instances x^iji = 1, . . . , N), together with preference information 
in the form of pairwise comparisons yi >-Xn Uj of some labels in y, suggesting that instance a;„ 
prefers label yi to yj . 

Motivated by the idea of a reject option in classification, the authors in |3| introduced a variant 
of the above setting in which the label ranker is allowed to partially abstain from a prediction. 
More specifically, it is allowed to make predictions in the form of partial instead of total orders: If 
the ranker is not sufficiently certain regarding the relative order of two alternatives and, therefore, 
cannot reliably decide whether the former should precede the latter or the other way around, it may 
abstain from this decision and instead declare these alternatives as being incomparable. Abstaining 
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in a consistent way, it should of course still produce an asymmetric and transitive relation, hence a 
partial order 

The approach in |3|, despite being the first to address the problem of learning to predict partial 
orders, still exhibits some disadvantages (see next section). In this paper, we therefore propose an 
alternative method, or rather a modification, which is based on the idea of predicting partial orders 
by thresholding parameterized probability distributions on rankings. Roughly speaking, by making 
stronger model assumptions, this approach is able to avoid inconsistencies that may occur in |3|, 
and hence simplifies the construction of consistent partial order relations; see Section 3 for details. 

Of course, despite being interesting from a theoretical point of view, these properties do not guar- 
antee a practical advantage in terms of prediction performance, especially in cases where the model 
assumptions might be violated. Therefore, we complement our theoretical results by an experimen- 
tal study in which we compare our new method with the original approach of |[3]. 

2 Previous Work 

The method in |3 | consists of two main steps and can be considered as a pairwise approach in the 
sense that, as a point of departure, a valued preference relation P : y x y ^ [0, 1] is produced, 
where P{yi,yj) is interpreted as a measure of support of the pairwise preference y,; >- yj. Sup- 
port is commonly interpreted in terms of probability, hence P is assumed to be reciprocal, that is, 
P{yi, yj) = 1 — P{yj,yi) for all yi,yj G y. Then, in a second step, a partial order Q is derived 
from P via thresholding: Q{yi,yj) = 1 if P{yi,yj) > q and Q{yi,yj) — otherwise, where 
1/2 < g < 1 is a threshold. Thus, the idea is to predict only those pairwise preferences that are 
sufficiently likely, while abstaining on pairs (j/^, yj) for which the probability P{yi,yj) is too close 
to 1/2. 

The first step of deriving the relation P is realized in [3 1 by means of an ensemble learning technique: 
Training an ensemble of standard label rankers, each of which provides a prediction in the form of 
a total order, P{yi, yj) is defined by the fraction of ensemble members voting for y.i >- yj. Other 
possibilities are of course conceivable, and indeed, the only important point to notice here is that the 
preference degrees P{yi, yj) are essentially independent of each other. Or, stated differently, they 
do not guarantee any specific properties of the relation P except being reciprocal. For the relation 
Q derived from P via thresholding, this has two important consequences: 

• If the threshold q is not large enough, then Q may have cycles. Thus, not all thresholds in 
[0.5, 1) are actually feasible. In particular, if q = 0.5 cannot be chosen, this also implies 
that the method may not be able to predict a total order as a special case. 

• Even if Q does not have cycles, it is not guaranteed to be transitive. 

To overcome these problems, the authors devise an algorithm that finds the smallest feasible thresh- 
old qmin and "repairs" a non-transitive relation Q by replacing it with its transitive closure. The 
complexity of this algorithm is 0(|3^|^). 

3 Predicting Partial Orders based on Probabilistic Models 

In order to tackle the above problems, our idea is to restrict the relation P so as to exclude the 
possibility of cycles and violations of transitivity from the very beginning. To this end, we take 
advantage of methods for label ranking that produce (parameterized) probability distributions over 
as predictions. Our main theoretical result is to show that thresholding pairwise preferences 
induced by such distributions yields preference relations with the desired properties, that is, partial 
order relations Q. 

In 121, a label ranking method was proposed that produces predictions expressed in terms of the 
Mallows model |5 1, a distance-based probability model belonging to the family of exponential dis- 
tributions. The standard Mallows model 

TX'^lfl-^^ exp(-6'i:>(7r,7ro)) 

PMe,^o) = (1) 



2 



is determined by two parameters: The ranking ttq G is the location parameter (mode, center 
ranking) and > is a spread parameter Moreover, D is a distance measure on rankings, and the 
constant (f) — (j>{d) is a normalization factor that depends on the spread (but, provided the right- 
invariance of D, not on ttq). Obviously, the Mallows model assigns the maximum probability to 
the center ranking ttq. The larger the distance D{tt, ttq), the smaller the probability of tt becomes. 
The spread parameter 6 determines how quickly the probability decreases, i.e., how peaked the 
distribution is around ttq. For 6^0, the uniform distribution is obtained, while for 6 oo, the 
distribution converges to the one-point distribution that assigns probability 1 to ttq and to all other 
rankings. 

Alternatively, the Plackett-Luce (PL) model was used in jTl. This is a stagewise model, which is 
specified by a parameter vector v ~ {vi,v2, ■ ■ ■ , vj^j) £ ||5l: 

M 

Pi7:\v) = T\ (2) 

fj[ + «7r(i+l) + • • ■ + V^^M) 

This model is a generalization of the well-known Bradley-Terry model for the pairwise comparison 
of alternatives, which specifies the probability that "a wins against 6" in terms of P(a ;^ 6) = 
^ "^^^ . Obviously, the larger Va in comparison to Vb, the higher the probability that a is chosen. 
Likewise, the larger the parameter Vi in (j2ji in comparison to the parameters vj, j ^ i, the higher 
the probability that the label yi appears on a top rank. An intuitively appealing explanation of the 
PL model can be given in terms of a vase model: If Vi corresponds to the relative frequency of the 
i-th label in a vase filled with labeled balls, then P(7r | d) is the probability to produce the ranking tt 
by randomly drawing balls from the vase in a sequential way and putting the label drawn in the fc-th 
trial on position k (unless the label was already chosen before, in which case the trial is annulled). 

Given a probability distribution P on the set of rankings f2, the probability of a pairwise preference 
yi >- yj (and hence the corresponding entry in the preference relation P) can be derived through 
marginalization: 

P{y,,y,)^P{y,yyj) = ^ P(7r), (3) 

where E{yi, yj) denotes the set of linear extensions of the incomplete ranking yi >- yj, i.e., the set 
of all rankings tt S $7 in which yi precedes yj . Our main theoretical result states that thresholding 
Q yields a proper partial order relation Q, both for the Mallows and the PL model. 

Theorem 1. Let P in ([5]) be the Mallows model (|7J, with a distance D having the so-called trans- 
position property, or the PL model (pi). Moreover, let Q be defined by the thresholded relation 
QiVijUj) = 1 if P{yi: Uj) > Q ond Qyyi, yj) = otherwise. Then Q defines a proper partial order 
relation for all q G [1/2, 1). 

A distance D on rankings is said to have the transposition property, if the following holds: Let tt 
and tt' be rankings so that, in both of them, yi precedes yj. Moreover, consider a third ranking tt" 
identical to tt', except for a transposition of y,; and yj. Then, -D(7r, tt') < 15(71, tt"). Of course, this 
property is intuitively plausible, and indeed, it is satisfied by most of the commonly used distance 
measures (see, e.g., |4|). 

While the proof of the above theorem is rather straightforward for the PL model, it becomes less 
obvious in the case of the Mallows model. In any case, it guarantees that a proper partial order 
relation can be predicted by simple thresholding, and without the need for any further reparation. 
Moreover, the whole spectrum of threshold parameters q E [1/2, 1) can be used. 



4 Experiments 

As mentioned earlier, the alternative approach outlined above does not automatically imply a prac- 
tical advantage, especially since it makes strong model assumptions (in terms of the Mallows or PL 
model) that are not necessarily satisfied. Therefore, we complement our theoretical results by an 
empirical study, in which we analyze the tradeoff between correctness and completeness achieved 
by different methods. 

If a model is allowed to abstain from making predictions, it is expected to reduce its error rate. In 
fact, it can trivially do so, namely by rejecting all predictions, in which case it avoids any mistake. 
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Figure 1: Trade-off between completeness and correctness for a label ranking variant of the UCI 
benchmark data set VOWEL: Existing pairwise method (solid line) versus new approach based on 
probabilistic models (dashed line). 



Clearly, this is not a desirable solution. Indeed, in the setting of prediction with reject option, there 
is always a trade-off between two criteria: correctness on the one side and completeness on the other 
side. An ideal learner is correct in the sense of making few mistakes, but also complete in the sense 
of abstaining rarely. The two criteria are conflicting: increasing completeness typically comes along 
with reducing correctness and vice versa, at least if the learner is effective in the sense that it abstains 
from those decisions that are indeed most uncertain. 

As measures of correctness and completeness, we use those that were proposed in |3 1. Correctness is 
measured by the gamma rank correlation (between the true ranking and the predicted partial order), 
and completeness is defined by one minus the (relative) number of pairwise comparisons on which 
the model abstains. 

The main conclusion that can be drawn from our results is that, as expected, our probabilistic ap- 
proach does indeed achieve a better trade-off between completeness and correctness, especially in 
the sense that it spans a wider range of values for the former Besides, we often observe that the 
level of correctness is increased, too. A typical example of the completeness/complexity trade-off is 
shown in Figure 1 . 
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